Multimodal speech recognition with ultrasonic sensors

نویسندگان

  • Bo Zhu
  • Timothy J. Hazen
  • James R. Glass
چکیده

Ultrasonic sensing of articulator movement is an area of multimodal speech recognition that has not been researched extensively. The widely-researched audio-visual speech recognition (AVSR), which relies upon video data, is awkwardly high-maintenance in its setup and data collection process, as well as computationally expensive because of image processing. In this thesis we explore the effectiveness of ultrasound as a more lightweight secondary source of information in speech recognition. We first describe our hardware systems that made simultaneous audio and ultrasound capture possible. We then discuss the new types of features that needed to be extracted; traditional Mel-Frequency Cepstral Coefficients (MFCCs) were not effective in this narrowband domain. Spectral analysis pointed to frequency-band energy averages, energy-band frequency midpoints, and spectrogram peak location vs. acoustic event timing as convenient features. Next, we devised ultrasonic-only phonetic classification experiments to investigate the ultrasound’s abilities and weaknesses in classifying phones. We found that several acoustically-similar phone pairs were distinguishable through ultrasonic classification. Additionally, several same-place consonants were also distinguishable. We also compared classification metrics across phonetic contexts and speakers. Finally, we performed multimodal continuous digit recognition in the presence of acoustic noise. We found that the addition of ultrasonic information reduced word error rates by 24-29% over a wide range of acoustic signal-to-noise ratio (SNR) (clean to 0dB). This research indicates that ultrasound has the potential to be a financially and computationally cheap noise-robust modality for speech recognition systems. Thesis Supervisor: James R. Glass Title: Principal Research Scientist Thesis Co-Supervisor: Karen Livescu Title: Research Assistant Professor

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimodal Silent Speech Interface based on Video, Depth, Surface Electromyography and Ultrasonic Doppler: Data Collection and First Recognition Results

Silent Speech Interfaces use data from the speech production process, such as visual information of face movements. However, using a single modality limits the amount of available information. In this study we start to explore the use of multiple data input modalities in order to acquire a more complete representation of the speech production model. We have selected 4 non-invasive modalities – ...

متن کامل

Combined Hand Gesture — Speech Model for Human Action Recognition

This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thu...

متن کامل

Exploiting Nonacoustic Sensors for Speech Enhancement*

Nonacoustic sensors such as the general electromagnetic motion sensor (GEMS), the physiological microphone (P-mic), and the electroglottograph (EGG) offer multimodal approaches to speech processing and speaker and speech recognition. These sensors provide measurements of functions of the glottal excitation and, more generally, of the vocal tract articulator movements that are relatively immune ...

متن کامل

Noise Suppression with Non-Air-Acoustic Sensors

Nonacoustic sensors such as the general electromagnetic motion sensor (GEMS), the physiological microphone (P-Mic), and the electroglottograph (EGG) offer multimodal approaches to speech processing and speaker and speech recognition. These sensors provide measurements of functions of the glottal excitation and, more generally, of the vocal tract articulator movements that are relatively immune ...

متن کامل

Multimodal Information Fusion

Humans interact with each other using different modalities of communication. These include speech, gestures, documents, etc. It is only natural that human computer interaction (HCI) should facilitate the same multimodal form of communication. In order to capture this information, one uses different types of sensors, i.e., microphones to capture the audio signal, cameras to capture life video im...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007